Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate?

نویسندگان

  • Antonia Zapf
  • Stefanie Castell
  • Lars Morawietz
  • André Karch
چکیده

BACKGROUND Reliability of measurements is a prerequisite of medical research. For nominal data, Fleiss' kappa (in the following labelled as Fleiss' K) and Krippendorff's alpha provide the highest flexibility of the available reliability measures with respect to number of raters and categories. Our aim was to investigate which measures and which confidence intervals provide the best statistical properties for the assessment of inter-rater reliability in different situations. METHODS We performed a large simulation study to investigate the precision of the estimates for Fleiss' K and Krippendorff's alpha and to determine the empirical coverage probability of the corresponding confidence intervals (asymptotic for Fleiss' K and bootstrap for both measures). Furthermore, we compared measures and confidence intervals in a real world case study. RESULTS Point estimates of Fleiss' K and Krippendorff's alpha did not differ from each other in all scenarios. In the case of missing data (completely at random), Krippendorff's alpha provided stable estimates, while the complete case analysis approach for Fleiss' K led to biased estimates. For shifted null hypotheses, the coverage probability of the asymptotic confidence interval for Fleiss' K was low, while the bootstrap confidence intervals for both measures provided a coverage probability close to the theoretical one. CONCLUSIONS Fleiss' K and Krippendorff's alpha with bootstrap confidence intervals are equally suitable for the analysis of reliability of complete nominal data. The asymptotic confidence interval for Fleiss' K should not be used. In the case of missing data or data or higher than nominal order, Krippendorff's alpha is recommended. Together with this article, we provide an R-script for calculating Fleiss' K and Krippendorff's alpha and their corresponding bootstrap confidence intervals.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Variance Estimation of Nominal-scale Inter-rater Reliability with Random Selection of Raters

Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater...

متن کامل

Reliability of Body Landmarks Analyzer for Measuring the Quadriceps Angle

Genovarum and Genovalgum are the most common postural deformities of the knee joint. A quadriceps angle is used to measure these anomalies. Methods of measuring this angle are divided into two categories: invasive and non-invasive. The purpose of the present research was to study the inter/intra rater reliability of the non-invasive Body Landmarks Analyzer method for measuring of the quadriceps...

متن کامل

A Comparison of Cohen's Kappa and Agreement Coefficients by Corrado Gini

The paper compares four coefficients that can be used to summarize inter-rater agreement on a nominal scale. The coefficients are Cohen's kappa and three coefficients that were originally proposed by the Italian statistician Corrado Gini. All four coefficients have zero value if the two nominal variables are statistically independent, and value unity if there is perfect agreement. The coefficie...

متن کامل

Confidence Intervals for Intraclass Correlation in Inter-Rater Reliability

AbstractCalculation of a confidence interval for intraclass correlation to assess inter-rater reliability is problematic when the number of raters is small and the rater effect is not negligible. Intervals produced by existing methods are uninformative: the lower bound is often close to zero, even in cases where the reliability is good and the sample size is large. In this paper, we show that t...

متن کامل

Psychometric properties of the Portuguese version of the Jebsen-Taylor test for adults with mild hemiparesis Avaliação das propriedades pscicométricas da versão em português do teste de Jebsen Taylor para adultos com hemiparesia leve

Objectives: To evaluate the psychometric properties of the Portuguese version of the Jebsen-Taylor Test (JTT) in patients with stroke. Methods: Forty participants who suffered a stroke in the cerebral hemisphere were videotaped while performing the JTT. Scores were defined by the time taken to perform the tasks, and two physical therapists evaluated the performance of the participants. Intraand...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 16  شماره 

صفحات  -

تاریخ انتشار 2016